Vikum Wijesinghe - September 2019
Other Kernels: https://www.kaggle.com/vikumsw/kernels


Here they are!. Let's welcome SpiderMan and IronMan. Today we are joining the class where IronMan teach Data Exploration to SpiderMan. Now let's be quite and listen to them.
Iron Man : Hi Spidy, Tell me what you wanna learn?
SpiderMan : I was studing on data exploration with python, but after minutes into it, I was feeling confused. I know that there is an huge collection of tools out there for Data Exploration : Matplotlib, Seaborn, ggplot, Bokeh, Plotly, Pygal, Altair, Geoplotlib, Gleam, Missingno and etc. But Which one should I use? How to Use?, This broad availability itself has created a confusion.
Iron Man : Ok. I get it. So what do you expect from this session?.
SpiderMan : What are the things we do in EDA? I mean like, is there a sequence of tasks?. What tools to use?. How to use?... and....
Iron Man : Okay. I get it.. It's best to explain the process while practicing. for demonstration let's use the two most known datasets in kaggle, the housing data set and titanic dataset. Are you ready spidy?.
SpiderMan : More Than ever boss!
Iron Man : Let's get the party started!.

Iron Man : We need party people here.. Let's invite them
#Inviting Party People
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import os
print(os.listdir("../input"))
#Load datasets for demonstrations
titanic_data = pd.read_csv("../input/titanic/train.csv")
house_data = pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
Iron Man : First and foremost it is important to have a look at data to get a clear sense on what we work on.
Let's get an idea ab at the data using following pandas functions.
#look at first 5 rows using .head()
house_data.head()
#Wanna see more?. try -> house_data.head(13) for first 13 rows.
#look at last 5 rows using .tail()
house_data.tail()
house_data.shape
house_data.columns
SpiderMan : I see. Our dataset is has 1460 rows, 81 columns. That is how to start right?. But thats a lots of data. Looks bit confusing...
Iron Man : That's True.. Because of it let's focus on a single feature and study it. It's called Univariate Analysis.. I read that part for you. Listen carefully?.
Okay!, Time to look at the problem of analysing a numerical feature. Say you have given a numerical feature named 'SalePrice' and you are expected to explore it. What are the things you would do?. Feeling confused or feel incompetent?. Do not worry, its very simple when you get it right.
What is numerical data? First, let's be clear about what is numerical data : Numerical data have meaning as a measurement, such as a person’s height, weight, IQ, or blood pressure; or they’re a count, such as the number of stock shares a person owns, how many teeth a dog has, or how many pages you can read of your favorite book before you fall asleep. Statisticians also call numerical data quantitative data. Read more here
Two Types of Numerical Data : Discrete Data & Continuous Data
What are we looking for? : What things would help us develop an understanding on the numerical feature?. There are basic set of stats & figures we use as presented below.
How to ? life is much easier when what you want is only few code lines away. There are few basic things that we can do,
Let's use 'SalePrice' column of the housing dataset for demonstation. This feature contains a list of sales prices for houses. That's all we know.. for now..:)
#Peek... head or tail
house_data['SalePrice'].head()
SpiderMan : Ah interesting!. A numerical feature with big values.
# Descriptive statistics summary
house_data['SalePrice'].describe()
SpiderMan : Thats some valueble info we got from only one line of code!. Let me put it into words ... We got 1460 values with a mean value of approx 180921, showing a std value of 79442.5. Minimum feature value is observed as 34900 while max value is reported as 755000.
from scipy.stats import norm
# Distribution plot
def distribution_plot(data):
sns.distplot(data, fit=norm)
plt.ylabel('Frequency')
plt.title(f'{data.name} distribution')
distribution_plot(house_data['SalePrice'])
#skewness and kurtosis
print("Skewness: %f" % house_data['SalePrice'].skew())
print("Kurtosis: %f" % house_data['SalePrice'].kurt())
SpiderMan : Distribution plot looks interesting. Even though the values spread from 34900 to 755000. Most of the values are at between 100000 and 200000.
Iron Man : That is univariate study of a numerical feature. Let's see what to do when we got a categorical variable. Like you did earlier.. listen carefully...
Next, Let me introduce you to the problem of analysing a categorical feature. Say you have given a categorical feature named 'OverallQual'(Overall material and finish quality) and you are expected to explore it. What are the things you would do?. Feeling confused or feel incompetent?. Do not worry, its very simple when you get it right.
What is categorical data? First, let's be clear about what is categorical data : Categorical data represent characteristics such as a person’s gender, marital status, hometown, or the types of movies they like. Categorical data can take on numerical values (such as “1” indicating male and “2” indicating female), but those numbers don’t have mathematical meaning. You couldn’t add them together, for example. Statisticians also call categorical data , qualitative data. Read more here
Two Types of Categorical Data : Nominal Data & Ordinal Data
What are we looking for? : What things would help us develop an understanding on a categorical feature?. There are basic set of stats & figures we use as presented below.
How to ? life is much easier when what you want is only few code lines away. There are few basic things that we can do,
Let's use 'OverallQual' column of the housing dataset for demonstation. This feature contains a list of Overall material and finish quality for houses. That's all we know.. and need to know for now..:)
OverallQual = house_data['OverallQual'].astype('category')
#Peek... head or tail
OverallQual.head()
SpiderMan : ah that is some useful info .. OverallQual feature got 10 categories.
# Descriptive statistics summary
OverallQual.describe()
SpiderMan : I see that we are getting answers to our questions fast.. There are 10 categories, the most frequent is '5' with a count of 397 out of 1460.
column = OverallQual;
print('Column Name:{}\nCardinality:{}\nValues:{}'.format(column.name,column.nunique(), column.unique()))
OverallQual.value_counts()
def getPlotsforCatFeature(series,figX=15,figY=7):
f,ax=plt.subplots(1,2,figsize=(figX,figY))
series.value_counts().plot.pie(autopct='%1.1f%%',ax=ax[0])
ax[0].set_title(f'{series.name}')
ax[0].set_ylabel('')
sns.countplot(series,ax=ax[1])
ax[1].set_title(f'Count plot for {series.name}')
plt.show()
getPlotsforCatFeature(OverallQual,15,5)
SpiderMan : That is value counts for each category. Category '5' is ranked first at a count of 397, closely followed by '6' & '7' with counts 374 and 319 respectively. Lowest counts are reported from categories '1' & '2' with just 5 observations even when combined. Great!...
I think I have an idea on univariate analysis now. But what about the relationships between features?

Iron Man : That's what we look for.. Let's Analyse feature pairs at ones looking for relationships.. It's called Bivariate Analysis..
Most often we feel curious on how two numerical features behave wrt each other. Following techniques helps us to develop insights on those hidden relationships.
For demonstration lets use 'GrLivArea'(groundLivingArea) and 'SalePrice' from housing dataset
#scatter plot
house_data.plot.scatter(x='GrLivArea', y='SalePrice');
''' Alternatively you could use following function
def scatterplot(seriesX,seriesY):
data = pd.concat([seriesY, seriesX], axis=1)
data.plot.scatter(x=seriesX.name, y=seriesY.name)
scatterplot(house_data['GrLivArea'],house_data['SalePrice'])
'''
SpiderMan : It seems that 'SalePrice' and 'GrLivArea' are good friends, with a linear relationship.
Iron Man : Nice.. You are getting it.. Next lets see how to find relationships of a numerical feature with a categorical feature
Let's try to visualize a relationship between a numerical feature and a categorical feature. Lets use Sale Price as the numerical feature and OverallQual which indicates Overall material and finish quality as the categorical feature from housing dataset. I know what you are thinking... we expect to see sale price increase with overall quality.. Lets see whether we could see that using following techniques,
#Box plot
num = 'SalePrice'
cat = 'OverallQual'
df = house_data
data = pd.concat([df[num], df[cat]], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x=cat, y=num, data=data)
fig.axis(ymin=0, ymax=800000);
SpiderMan : How beautiful! Just as we expected! Sales Price increses with OverallQual(Overall material and finish quality). Shall we do this to analyse relationship with SalePrice for few more categorical columns?
def boxplot(x, y, **kwargs):
sns.boxplot(x=x, y=y)
x=plt.xticks(rotation=90)
def fillMissingCatColumns(data,categorical):
for c in categorical:
data[c] = data[c].astype('category')
if data[c].isnull().any():
data[c] = data[c].cat.add_categories(['MISSING'])
data[c] = data[c].fillna('MISSING')
def getboxPlots(data,var,categorical):
fillMissingCatColumns(data,categorical)
f = pd.melt(data, id_vars=var, value_vars=categorical)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)
g = g.map(boxplot, "value", var)
data = house_data.copy()
categorical = [f for f in data.columns if data.dtypes[f] == 'object']
getboxPlots(data,'SalePrice',categorical)
Iron Man : That's what you asked for... Going through each one to identify relationships is your homework.. :D. Guess whats next..
SpiderMan : mm... Correlation Analysis??
Iron Man : Nope!. it is tea time!... 
With some cake!

large correlations between features is one of the best indicators used for feature selection. If we have a data set with many columns, a good way to quickly check correlations among columns is by visualizing the correlation matrix as a heatmap.
def getCorrHeatMap(dataFrame,figSize=[12,9]):
corrmat = dataFrame.corr()
f, ax = plt.subplots(figsize=(figSize[0], figSize[1]))
sns.heatmap(corrmat, vmax=.8, square=True);
getCorrHeatMap(house_data)
Iron Man : When its lighter(Closer to white) that means it has a larger positive correlation whereas darker cells mean a larger negative correlation.
Iron Man : We are more interested about larger correlations. So we can filter the columns and get a heatmap showing only larger correlations with feature 'SalePrice'.
def getZoomedCorrHeatMap(dataFrame,featureCount,target,figSize=[12,9]):
corrmat = dataFrame.corr()
cols = corrmat.nlargest(featureCount, target)[target].index
f , ax = plt.subplots(figsize = (figSize[0],figSize[1]))
cm = np.corrcoef(dataFrame[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
getZoomedCorrHeatMap(house_data,10,'SalePrice',[10,8])
Top two features having highet correlations include OverallQual(Overall material and finish quality) with 0.79 and GrLivArea with 0.71.
SpiderMan :

SpiderMan : Wow great! Is that it?. Any more tips!
Iron Man : Yes!. go through the kernel below... I have noted down some tips for you..
def getMissingValuesInfo(df):
total = df.isnull().sum().sort_values(ascending = False)
percent = round(df.isnull().sum().sort_values(ascending = False)/len(df)*100, 2)
temp = pd.concat([total, percent], axis = 1,keys= ['Total Missing Count', '% of Total Observations'])
temp.index.name ='Feature Name'
return temp.loc[(temp['Total Missing Count'] > 0)]
getMissingValuesInfo(house_data)
# Visualizing missing counts
missing = house_data.isnull().sum()
missing = missing[missing > 0]
missing.sort_values(inplace=True)
plt.subplots(figsize=(15,5))
missing.plot.bar()
plt.show()
fig, ax = plt.subplots(figsize=(20,5))
sns.heatmap(house_data.isnull(), cbar=False, cmap="YlGnBu_r")
plt.show()
White spaces shows the missing value in the data frame.
def distplots(data,num_features):
f = pd.melt(data, value_vars=num_features)
g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)
g = g.map(sns.distplot, "value")
num_features = house_data.select_dtypes(include=['int64','float64'])
distplots(house_data,num_features)
num_features = house_data.select_dtypes(include=['int64','float64'])
num_features.describe()
categorical_features = house_data.select_dtypes(include='object')
categorical_features.describe()
def printUniqueValues(df,cardinality=1000):
n = df.select_dtypes(include=object)
for column in n.columns:
uCount = df[column].nunique()
if uCount<=cardinality:
print('{:>12}: {} {}'.format(column,uCount, df[column].unique()))
#print(column,': [',uCount , '] ', df[column].unique())
printUniqueValues(house_data,10)
Thanks to Firath's kernel : https://www.kaggle.com/frtgnn/thorough-eda-with-a-single-line-pandas-profiling/
Pandas Profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.
import pandas_profiling
profile_report = pandas_profiling.ProfileReport(titanic_data)
#profile_report.to_file("profile_report.html")
profile_report
# We can use pandas profiling on selected features too.
# Using Pandas Profiling to analyse SalePrice feature in housing dataset.
import pandas_profiling
series = house_data['SalePrice']
d = { series.name : series}
df = pd.DataFrame(d)
pandas_profiling.ProfileReport(df)